73 research outputs found

    Window-Slicing Techniques Extended to Spanning-Event Streams

    Get PDF
    Streaming systems often use slices to share computation costs among overlapping windows. However they are limited to instantaneous events where only one point represents the event. Here, we extend streams to events that come with a duration, denoted as spanning events. After a short review of the new constraints ensued by event lifespan in a temporal sliding-window context, we propose a new structure for dealing with slices in such an environment, and prove that our technique is both correct and effective to deal with such spanning events

    An algebraic approach to ensemble clustering

    Get PDF
    International audienceIn clustering, consensus clustering aims at providing a single partition fitting a consensus from a set of independently generated. Common procedures, which are mainly statistical and graph-based, are recognized for their robustness and ability to scale-up. In this paper, we provide a complementary and original viewpoint over consensus clustering, by means of algebraic definitions which allow to ascertain the nature of available inferences in a systematic approach (e.g. a knowledge base). We found our approach on the lattice of partitions, for which we shall disclose how some operators can be added with the aim to express a formula representing the consensus. We show that adopting an incremental approach may assist to retain significant amount of aggregate data which fits well with the set of input clusterings. Beyond that ability to model formulae, we also note that its potential cannot be easily captured through such a logical system. It is due to the volatile nature of handling partitions which finally impacts on ability to draw some valuable conclusions

    Summary Management in P2P Systems

    Get PDF
    International audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer suf- ficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the appropriate algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Decision Support to Crowdsourcing for Annotation and Transcription of Ancient Documents: The RECITAL Workshop

    Full text link
    In the 18th century in Paris, only two public theatres could officially perform comedies: the Com{\'e}die-Fran{\c c}aise, and the Com{\'e}die-Italienne. The latter was much less well known. By studying a century of accounting registers, we aim to learn more about its successful plays, its actors, musicians, set designers, and all the small trades necessary for its operation, its administration, logistics and finances. To this end, we employ a mass of untapped and unpublished resources, the 27,544 pages of 63 daily registers available at the Biblioth{\`e}que Nationale de France (BnF). And we take a decidedly fresh look at emerging forms of creation and changes in the entertainmenteconomy. We developed the crowdsourcing platform RECITAL to collect and index the data from theregisters, following an emerging trend in Digital Humanities. RECITAL is built upon the ScribeAPI framework and it offers a fully-fledged web application to classify the pages, annotate with marks and tags, transcribe the indexed marks and even to verify the previous transcripts. We also describe a multi-level data model and to develop a series of monitoring anddecision tools to support crowdsourced data management up to their definitive form

    Joining Distributed Database Summaries

    Get PDF
    The database summarization system coined SaintEtiQ provides multi-level summaries of tabular data stored into a centralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, in many companies, data are distributed among several sites, either homogeneously (i.e. , sites contain data for a common set of features) or heterogeneously (i.e. , sites contain data for different features). Consequently, the current centralized version of SaintEtiQ is either not feasible or even not desirable due to privacy or resource issues. In this paper, we propose two new algorithms for summarizing heterogeneously distributed data without a prior "unification" of the data sources: Subspace-Oriented Join Algorithm (SOJA) and Tree Alignement-based Join Algorithm (TAJA). The main idea of such algorithms consists in applying innovative joins on two local models, computed over two disjoint sets of features, to provide a global summary over the full feature set without scanning the raw data. SOJA takes one of the two input trees as the base model and the other one is processed to complete the first one, whereas TAJA rearranges summaries by levels in a top-down manner. Then, we propose a consistent quality measure to quantify how good our joined hierarchies are. Finally, an experimental study, using synthetic data sets, shows that our joining processes (SOJA and TAJA) result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time w.r.t. the centralized approach

    Cluster-based Search Technique for P2P Systems

    Get PDF
    We consider network clustering as the way to improve the performance of locating data in unstructured P2P systems. Connectivity-based Distributed node Clustering (CDC), and SCM-based Distributed Clustering (SDC) are two major protocols that allow partitioning a network topology into clusters, based on node connectivity. These protocols focus on the accuracy of the clustering scheme, i.e. using the Scale Coverage Measure (SCM), and its maintenance against node dynamicity. However, they do not propose search techniques that may take advantage of their clustering information. Thus, their proposals have not been evaluated according to the motivation behind. In this work, we propose a new, efficient Cluster-based Search Technique (CBST) for unstructured P2P systems. We use it to validate connectivity-based clustering schemes, according to the trade-off between cost of maintaining clusters, and benefit for query processing. Our experimental results show the efficiency of CBST implemented over the SDC protocol. By simply exploiting clustering features of the underlying network, a query can travel across a large number of nodes with a minimum number of messages. CBST eliminates a large portion of redundant messages, thus avoiding to overload the P2P network

    Design of PeerSum: a Summary Service for P2P Applications

    Get PDF
    International audienceSharing huge databases in distributed systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A more efficient approach is to rely on compact database summaries rather than raw database records, whose access is costly in large distributed systems. In this paper, we propose PeerSum, a new service for managing summaries over shared data in large P2P and Grid applications. Our summaries are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Summary Management in P2P Systems

    Get PDF
    International audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer suf- ficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the appropriate algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

    Peersum : Gestion des résumés de données dans les systèmes P2P

    Get PDF
    Base de Données Avancées (BDA)National audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. The main contribution of this paper is to define an efficient algorithm for partitioning an unstructured P2P network into domains, in order to optimally distribute summaries in the network. Then, we propose a distributed algorithm for maintaining a summary in a given domain. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance
    corecore